Skip to content

Retain underlying error on proxy mode connection failure#94998

Merged
elasticsearchmachine merged 20 commits intoelastic:mainfrom
ywangd:rcs-proxy-mode-error-message
Apr 25, 2023
Merged

Retain underlying error on proxy mode connection failure#94998
elasticsearchmachine merged 20 commits intoelastic:mainfrom
ywangd:rcs-proxy-mode-error-message

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Apr 4, 2023

When a remote cluster node fails to connect in sniff mode, the
underlying error is preserved and reported. But in proxy mode, the
underlying error is swallowed and it reports only a generic "Unable to
open any proxy connections" error.

This PR improves the error reporting with proxy mode to align with that
for the sniff mode. This allows consistent error message in the API
response and clearer diagnositic for underlying issues, e.g.
authentication failure, incompatible license etc.

ywangd added 2 commits April 4, 2023 14:15
When a remote cluster node fails to connect in sniff mode, the
underlying error is preserved and reported. But in proxy mode, the
underlying error is swallowed and it reports only a generic "Unable to
open any proxy connections" error.

This PR improves the error reporting with proxy mode to align with that
for the sniff mode. This allows consistent error message in the API
response and clearer diagnositic for underlying issues, e.g.
authentication failure, incompatible license etc.
@ywangd ywangd added >enhancement :Distributed/Network Http and internode communication implementations :Security/Security Security issues without another label v8.8.0 labels Apr 4, 2023
@ywangd ywangd requested review from Tim-Brooks and n1v0lg April 4, 2023 05:35
@elasticsearchmachine
Copy link
Collaborator

Hi @ywangd, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added Team:Distributed Meta label for distributed team. Team:Security Meta label for security team labels Apr 4, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Comment on lines -263 to +267
openConnections(finished, attemptNumber + 1);
if (attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN && connectionManager.size() == 0) {
logger.warn(() -> "failed to open any proxy connections to cluster [" + clusterAlias + "]", e);
finished.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias, e));
} else {
openConnections(finished, attemptNumber + 1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change basically tries to mirror how error is handled in the sniff mode so that the underlying cause of the failure is retained and reported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, do we still need to check connectionManager.size() in the else branch below? I.e. can we remove this cause-free exception?

finished.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias));

Or maybe we should have the attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN check in the onResponse handler above too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, do we still need to check connectionManager.size() in the else branch below?

In practice, we should not need it. But in theory, it is possible that the if (attemptNumber <= MAX_CONNECT_ATTEMPTS_PER_RUN) check fails in the first invocation and goes to the else branch.

We have a similar else branch in SniffConnectionStrategy as well. Therefore I didn't remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, yes, but since this only affects the first invocation (and only if there are no seed nodes or some other weird config?) I think it'd be better to handle it up-front. That would clarify that once we get here we never return a cause-free NoSeedNodeLeftException.

}
logger.warn(() -> "fetching nodes from external cluster [" + clusterAlias + "] failed", e);
listener.onFailure(e);
listener.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias, e));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency between the two modes, I am also updating this to be wrapped within a NoSeedNodeLeftException which makes the error message more explanatory in REST response.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this only makes sense if seedNodesSuppliers.hasNext() == false. If we encountered some other more fundamental error then we didn't run out of seed nodes here.

Also IMO it would be good to eliminate the seedNodesSuppliers.hasNext() == false case entirely in this method so that there's always a cause to report.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this only makes sense if seedNodesSuppliers.hasNext() == false
Good catch! Will fix.

Also IMO it would be good to eliminate the seedNodesSuppliers.hasNext() == false case entirely in this method so that there's always a cause to report.

Sorry but I don't quite follow. There is no existing seedNodesSuppliers.hasNext() == false case. Are you referring to the else branch of the openning if (seedNodesSuppliers.hasNext()), i.e. related to the above comment? I think in practice the else branch never runs. So the cause is always reported. Please let me know if you'd like to drop the else branch for both Sniff and Proxy strategies. I can make the change to both. I just want them to be consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I meant the else branch. I wonder if it could run if seedNodesSuppliers is empty. Not sure if this can happen but maybe it could in future even if not today? I'd like to protect against that at least, but again I think it's better to do so up-front rather than on every iteration.

Comment on lines +470 to +476
expectThrows(Exception.class, connectFuture::actionGet);
final NoSeedNodeLeftException exception = (NoSeedNodeLeftException) expectThrows(
ExecutionException.class,
connectFuture::get
).getCause();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sniff mode was throwing a ConnectTransportException exception directly. As commented above, I changed it to be wrapped inside a NoSeedNodeLeftException.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. I left a couple of inline thoughts. Moreover:

  • I've seen cases where the last error wasn't the useful one - could we do more to report the other errors encountered along the way (maybe collecting them as suppressed exceptions?

  • If the inner exception is a ConnectTransportException then it will report the address to which it was trying to connect, but other exceptions will not include this and I think that makes further investigation much harder. Could we make sure that we include the target address always?

Comment on lines -263 to +267
openConnections(finished, attemptNumber + 1);
if (attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN && connectionManager.size() == 0) {
logger.warn(() -> "failed to open any proxy connections to cluster [" + clusterAlias + "]", e);
finished.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias, e));
} else {
openConnections(finished, attemptNumber + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, do we still need to check connectionManager.size() in the else branch below? I.e. can we remove this cause-free exception?

finished.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias));

Or maybe we should have the attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN check in the onResponse handler above too.

}
logger.warn(() -> "fetching nodes from external cluster [" + clusterAlias + "] failed", e);
listener.onFailure(e);
listener.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias, e));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this only makes sense if seedNodesSuppliers.hasNext() == false. If we encountered some other more fundamental error then we didn't run out of seed nodes here.

Also IMO it would be good to eliminate the seedNodesSuppliers.hasNext() == false case entirely in this method so that there's always a cause to report.

@ywangd ywangd requested a review from DaveCTurner April 6, 2023 09:04
@ywangd
Copy link
Member Author

ywangd commented Apr 6, 2023

@DaveCTurner I pushed an update that should address your comments. Could you please take another look to see whether the changes match your suggestion? Thanks!

.put("cluster.remote.invalid_remote.mode", "proxy")
.put("cluster.remote.invalid_remote.proxy_address", fulfillingCluster.getRemoteClusterServerEndpoint(0))
.build()
randomBoolean() && false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&& false maybe leftover from a debugging session?

if (seedNodesSuppliers.hasNext()) {
listener.onFailure(getNoSeedNodeLeftException(Set.of(e)));
} else {
listener.onFailure(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be flipped? I.e., if there are no seed nodes left, we want a NoSeedNodeLeftException or no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went back to the original thread and I see the intention now: if we hit an exception that we think we can't recover from by moving on to a different seed node, we don't want to wrap in NoSeedNodeLeftException here.

Given that, the behavior is still confusing though: suppose we have two seed nodes, and we fail with a ConnectTransportException on the first one, then with a "terminating" exception on the second node. In that case, we don't wrap the exception. However, if we had three seed nodes, and failed on the second with a "terminating" exception, we would wrap.

Sorry if I'm missing something here! If I am, it would be great to include a comment in the code with a little more context.

As a side note: not wrapping on "terminating" exceptions is what seems to produce the inconsistency in status codes between proxy mode and sniff mode (e.g., for the license failures, for get a 500 in proxy mode, but a 403 in sniff mode). I'm wondering if we should introduce a second class of failure that still produces a 500 for the case where we don't want to return a NoSeedNodeLeftException.

@ywangd ywangd requested a review from n1v0lg April 11, 2023 07:11
@ywangd
Copy link
Member Author

ywangd commented Apr 11, 2023

Just pushed an update to revert some changes to SniffConnectionStrategy. I think it's better to focus on proxy mode in this PR as suggested in the title.

@ywangd
Copy link
Member Author

ywangd commented Apr 17, 2023

@DaveCTurner @n1v0lg Friendly reminder for review at your convenience. As commented above, I have decided to take a minimal approach with this PR to focus on its primary goal. Happy to work on follow-ups once this one is through.

Copy link
Contributor

@n1v0lg n1v0lg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Sorry for the delay!

From my end, I'm good with only updating proxy mode, and leaving out the sniff mode improvements. Likewise, good with not further refactoring proxy connection strategy in this PR.

}
}

private NoSeedNodeLeftException getNoSeedNodeLeftException(Collection<Exception> suppressedExceptions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're only ever passing an empty set here, we could just remove the suppressedExceptions arg altogether.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I have reduced scope of this PR. This method is no longer necessary. I inlined it which also removes the need of suppressedExceptions. Thanks for spotting it.

if (countDown.countDown()) {
openConnections(finished, attemptNumber + 1);
if (attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN && connectionManager.size() == 0) {
logger.warn(() -> "failed to open any proxy connections to cluster [" + clusterAlias + "]", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity checking that warn is the right log level: this could be caused by longer-lasting but still temporary network glitches, so I'm wondering if we want info or even debug here instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warn is good. info shouldn't really be used to report exceptions, and debug is hidden by default so will make troubleshooting too difficult. See these docs for more info.

Copy link
Contributor

@n1v0lg n1v0lg Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheers for the pointer, all good to keep this at warn here!

@ywangd
Copy link
Member Author

ywangd commented Apr 20, 2023

@elasticmachine update branch

@ywangd
Copy link
Member Author

ywangd commented Apr 21, 2023

@DaveCTurner Do you still plan to review the PR? I'd like to get it merged before 8.8 FF. Thanks!

@ywangd
Copy link
Member Author

ywangd commented Apr 24, 2023

@elasticmachine update branch

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Apr 24, 2023
@ywangd
Copy link
Member Author

ywangd commented Apr 24, 2023

I'll be merging this PR. Happy to work on follow ups as discussed.

@ywangd
Copy link
Member Author

ywangd commented Apr 24, 2023

Failure is #95514

@ywangd
Copy link
Member Author

ywangd commented Apr 25, 2023

@elasticmachine update branch

@ywangd
Copy link
Member Author

ywangd commented Apr 25, 2023

@elasticmachine update branch

@ywangd
Copy link
Member Author

ywangd commented Apr 25, 2023

@elasticmachine run elasticsearch-ci/docs-check

@elasticsearchmachine elasticsearchmachine merged commit 4d8d412 into elastic:main Apr 25, 2023
@ywangd ywangd deleted the rcs-proxy-mode-error-message branch April 25, 2023 05:17
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Sep 1, 2023
This PR removes a bogus assertion from ProxyConnectionStrategy. The
statement tries to assert that all connection exceptions are caught in
the if branch and we should have at least one connection when the code
reaches the else branch.

This is simply not true because the process of checking whether we
should open more connections and subsequently openning connections are
*not* atomic. It is possible that the number of connections changes,
e.g. remote end drops the connection, in between which makes the code
flow go to the else branch and trips the bogus assertion.

Relates: elastic#94998
Resolves: elastic#99113
elasticsearchmachine pushed a commit that referenced this pull request Sep 12, 2023
Today, when the number of attempts is exhausted, ProxyConnectionStrategy
checks the number of connections before returns. It reports connection
failure if the number of connections is zero at the time of checking.
However, this behaviour is incorrect. In rare cases, a connection can be
dropped right after it is initially established and before the number
checking. From the perspective of the `openConnections` method, it
should not care whether or when opened connections are subsequently
closed. As long as connections have been initially established, it
should report success instead of failure. 

This PR adjusts the code to report success in above situation.

Relates: #94998 Resolves: #99113
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Network Http and internode communication implementations >enhancement :Security/Security Security issues without another label Team:Distributed Meta label for distributed team. Team:Security Meta label for security team v8.8.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants